Incremental Record Linkage
نویسندگان
چکیده
Record linkage clusters records such that each cluster corresponds to a single distinct real-world entity. It is a crucial step in data cleaning and data integration. In the big data era, the velocity of data updates is often high, quickly making previous linkage results obsolete. This paper presents an end-to-end framework that can incrementally and efficiently update linkage results when data updates arrive. Our algorithms not only allow merging records in the updates with existing clusters, but also allow leveraging new evidence from the updates to fix previous linkage errors. Experimental results on three real and synthetic data sets show that our algorithms can significantly reduce linkage time without sacrificing linkage quality.
منابع مشابه
Probabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملDetermining Record Linkage Parameters Using an Iterative Logistic Regression Approach
1. Abstract. Fellegi and Sunter (1969) developed an algorithm for linking records. Each pair of records is scored on a field by field basis. If the records agree on a given field, a bonus is added to the score. If they disagree, a penalty is subtracted from the score. The records are designated as a match provided the score exceeds a certain threshold. In order to employ this method, one has to...
متن کاملTuning & Recommended Related Evolution Approaches for Distributed Databases
--Today’s databases are complex databases with duplicates. Due to complexity database we introduce the tuning and recommendation techniques. Tuning and recommendation process is important task in data integration task. Different existing system techniques like record matching, record linkage detects the same entities in single database. Deduplication removes the duplicates in single database. T...
متن کاملRecord Linkage I: Evaluation of Commercially Available Record Linkage Software for Use in NASS
Record linkage is an important technique in NASS for minimizing the presence of duplicate names on its list sampling frame of farm operators and agribusinesses. In the late 1970' s, NASS developed an automated record linkage system which runs on an IBM mainframe for this purpose. With changes in technology, the need has arisen for portability between platforms, integration with client/server te...
متن کاملA Decision Tree Based Record Linkage for Recommendation Systems
Record linkage merges all the records relating to the same entity from multiple datasets, at the entity level. It is the initial data preparation phase for most of the database projects. Traditionally one to one data linkage is performed among the entities of same type with common unique identifier. The proposed one to many and/or many to many record linkage method is able to link the entities ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2014